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Abstract 

Crowdsourcing platforms are now extensively used for conducting subjective 
pairwise comparison studies. In this setting, a pairwise comparison dataset is 
typically gathered via random sampling, either with or without replacement. 
In this paper, we use tools from random graph theory to analyze these two 
random sampling methods for the HodgeRank estimator. Using the Fiedler 
value of the graph as a measurement for estimator stability (informativeness), 
we provide a new estimate of the Fiedler value for these two random graph 
models. In the asymptotic limit as the number of vertices tends to infinity, we 
prove the validity of the estimate. Based on our findings, for a small number 
of items to be compared, we recommend a two-stage sampling strategy where 
a greedy sampling method is used initially and random sampling without 
replacement is used in the second stage. When a large number of items is 
to be compared, we recommend random sampling with replacement as this 
is computationally inexpensive and trivially parallelizable. Experiments on 
synthetic and real-world datasets support our analysis. 
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1. Introduction 


With the advent of ubiquitous internet access and the growth of crowd¬ 
sourcing platforms (e.g., MTurk, InnoCentive, CrowdFlower, CrowdRank, 
and AllOurIdeas), the crowdsourcing strategy is now employed by a variety 
of communities. Crowdsourcing enables researchers to conduct social exper¬ 
iments on a heterogenous set of participants and at a lower economic cost 
than conventional laboratory studies. For example, researchers can harness 
internet users to conduct user studies on their personal computers. Among 
various approaches to conduct subjective tests, pairwise comparisons are ex¬ 
pected to yield more reliable results. However, in crowdsourced studies, the 
individuals performing the ratings are diverse compared to more controlled 
settings, which is difficult to control for using traditional experimental de¬ 
signs; researchers have recently proposed several randomized methods to con¬ 
duct user studies [1, 2, 3], which accommodate incomplete and imbalanced 
data. 

HodgeRank, as an application of combinatorial Hodge theory to the pref¬ 
erence or rank aggregation problem from pairwise comparison data, possibly 
being incomplete and imbalanced, was first introduced by [4], and inspired 
a series of studies in statistical ranking [5, 6, 7, 8]. Hodge theory has also 
found applications in game theory [9] and computer vision [10, 11], in ad¬ 
dition to traditional applications in fluid mechanics [12] etc. HodgeRank 
formulates the ranking problem in terms of the discrete Hodge decomposi¬ 
tion of the pairwise data and shows that it can be decomposed into three 
orthogonal components: a gradient flow representing a global rating (optimal 
in the L 2 " n orm sense), a triangular curl flow representing local inconsistency, 
and a harmonic flow representing global inconsistency. Such a perspective 
generalizes various linear statistical models to provide a universal geomet¬ 
ric description of the structure of paired comparison data, which is possibly 
incomplete and imbalanced due to crowdsourcing. 

The two most popular random sampling schemes in crowdsourcing exper¬ 
iments are random sampling with replacement and random sampling without 
replacement. In random sampling with replacement, one selects a compar¬ 
ison pair randomly from the whole dataset regardless if the pair has been 
selected before; whence it is memory free. In random sampling without re¬ 
placement, each comparison pair in the dataset has an equal chance of being 
selected; once selected it cannot be chosen again until all possible pairs have 
been chosen. The simplest model of random sampling without replacement 
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in paired comparisons is the Erdos-Renyi random graph, which is a stochas¬ 
tic process that starts with n vertices and no edges, and at each step adds 
one new edge uniformly [13]. As one needs to avoid previous edges, such a 
sampling scheme is not memory-free and may lead to weak dependence in 
some estimates. 

Recently, [2, 3] develops the application of HodgeRank with random graph 
designs in subjective Quality of Experience (QoE) evaluation and shows that 
random graphs could play an important role in guiding random sampling 
designs for crowdsourcing experiments. In particular, exploiting topology 
evolution of clique complexes induced from Erdos-Renyi random graphs [14] , 
[3] shows that at least 0(nlogn) distinct random edges are necessary to 
ensure the inference of a global ranking and 0(n 3//2 ) distinct random edges 
are sufficient to remove the global inconsistency. 

On the other hand, there are active sampling schemes which are designed 
to maximize the information in the collected dataset, potentially reducing 
the amount of data collected. Recently, [7, 8] exploits a greedy sampling 
method to maximize the Fisher information in HodgeRank, which is equiva¬ 
lent to maximizing the smallest nonzero eigenvalue of the unnormalized graph 
Laplacian (a.k.a. Fiedler value or algebraic connectivity ). Although the 
computational cost of such greedy sampling is prohibitive for large graphs, 
it effectively boosts the algebraic connectivity compared to Erdos-Renyi ran¬ 
dom graphs. 

However, active sampling for data acquisition is not always feasible. For 
example, when data is collected from the Internet crowd or purchasing pref¬ 
erences, data collection is in general passive and independent. An important 
benefit of random sampling over active methods is that data collection can 
be trivially parallelized: comparisons can be collected from independent or 
weakly dependent processes, each selected from a pre-assigned block of object 
pairs. From this viewpoint, the simplicity of random sampling allows flex¬ 
ibility and applicability to diverse situations, such as online or distributed 
ranking, often desirable for crowdsourcing scenarios. 

Therefore, our interest in this paper is to investigate the characteristics 
of these three sampling methods (he., random sampling with/without re¬ 
placement and greedy sampling) for HodgeRank and identify an attractive 
sampling strategy that is particularly suitable for crowdsourcing experiments. 
The natural questions we are trying to address are: (i) which sampling scheme 
is the best, e.g., contains the most information for HodgeRank? and (ii) how 
do random and greedy sampling schemes compare in practice? 
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We approach these problems with a combination of theory and experiment 
in this paper. Performance of these sampling schemes is evaluated via the 
stability of HodgeRank, as measured by the Fiedler value. The Erdos-Renyi 
random graph model is associated with random sampling without replace¬ 
ment. For this model, an estimate of the Fiedler value was recently given in 
[15]. The proof of this estimate hinges on an estimate of [16], which can be 
shown to imply that, at first order, the Fiedler value is the minimal degree of 
the graph. The minimal degree of the graph can then be estimated from the 
binomial distribution. To analyze the random graph model associated with 
random sampling with replacement, we generalize the result given in [16] to 
multigraphs. A simple Normal approximation is then used to estimate the 
Fiedler value. As the graphs become increasingly dense, we prove that both 
random sampling methods asymptotically have the same Fiedler value. Our 
analysis implies: 

i) For a finite graph which is sparse, random sampling with and with¬ 
out replacement have similar performance; for a dense finite graph, 
random sampling without replacement is superior to random sampling 
with replacement, and approaches the performance of greedy sampling. 

ii) For very large graphs, the three considered sampling schemes exhibit 
similar performance. 

In particular, the asymptotic behavior of the two random sampling schemes 
is rigorously proved in Theorem 1 and their discrepancy for small sample 
sizes is supported by heuristic estimates (see Sections 3.4 and 3.5). These 
analytic conclusions and the performance of the greedy sampling strategy 
are supported by both simulated examples and real-world datasets (see Sec¬ 
tion 4). Based on our findings, for a relatively small number of items to 
be compared, we recommend a two-stage sampling strategy where a greedy 
sampling method is used initially and random sampling without replacement 
is used in the second stage. When a large number of items is to be compared, 
we recommend random sampling with replacement as this is computationally 
inexpensive and trivially parallclizable. 

Outline. Section 2 contains a review of related work. Then we es¬ 
tablish some theoretical results for random sampling methods in Section 3. 
Proofs will be collected in Appendix A. The results of detailed experiments 
on crowdsourced data are reported in Section 4. We conclude in Section 5 
with some remarks and a discussion of future work. 


4 


2. Related Work 


2.1. Crowdsourcing 

The term “crowdsourcing” is a portmanteau of “crowd” and “outsourc¬ 
ing”. It is distinguished from outsourcing in that the work comes from an 
undefined public rather than being commissioned from a specific, named 
group. The benefits of crowdsourcing include time-efficiency and low mon¬ 
etary costs. Among various crowdsourcing platforms, Amazon’s Mechanical 
Turk (MTurk) is probably the most popular and provides a marketplace for 
a variety of tasks; anyone seeking help from the Internet crowd can post 
their task requests to the website. Another platform, Innocentive, enables 
organizations to engage diverse innovation communities such as employees, 
partners, or customers to rapidly generate novel ideas and innovative so¬ 
lutions to challenging research and development problems. CrowdFlower’s 
expertise is in harnessing the Internet crowd to provide a wide range of enter¬ 
prise solutions, taking complicated projects and dividing them into smaller, 
simpler tasks, which are then completed by individual contributors. Crow- 
dRank is an innovative platform that draws on the over 3 million community 
votes already cast to bring the crowdsourcing revolution to rankings via a 
novel pairwise ranking methodology that avoids the tedium of asking com¬ 
munity members to rank every item in a category. In addition, Allourideas 
provides a free and open-source website that allows groups all over the world 
to create and use pairwise wiki surveys. Respondents can either participate 
in a pairwise wiki survey or add new items that are then presented to future 
respondents. 

With the help of these platforms, requesters post tasks ( e.g. image an¬ 
notation [17, 18], document relevance [19], document evaluation [20], music 
emotion recognition [21], affection mining in computer games [22], and qual¬ 
ity of experience evaluation [23, 3]) and users are compensated in the form 
of micro-payments for completing these posted tasks. Several studies have 
been conducted to evaluate the quality of completed tasks obtained from 
crowdsourcing approaches. For example, researchers have investigated the 
reliability of non-experts and found that a single expert in the majority of 
cases is more reliable than a non-expert. However, using an aggregate of 
several, cheap non-expert judgements could approximate the performance of 
expensive expertise [24, 25]. From this point of view, conducting subjective 
tests in a crowdsourcing context is a reasonable strategy. 
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2.2. Pairwise ranking aggregation 

The problem of ranking or rating with paired comparison data has been 
widely studied in a variety of fields including decision science [26], machine 
learning [27], social choice [28], and statistics [29]. Various methods have been 
studied for this problem, which, among others, includes maximumlikclihood 
under a Bradley-Terry model, rank centrality (PageRank/MC3) [30, 31], 
HodgeRank [4], and a pairwise variant of the Borda count [32, 4]. If we con¬ 
sider the setting where pairwise comparisons are drawn I.I.D. from some fixed 
but unknown probability distribution, under a “time-reversibility” condition, 
the rank centrality (PageRank) and HodgeRank algorithms both converge to 
an optimal ranking [33]. However, PageRank is only able to aggregate the 
pairwise comparisons into a global ranking over the items. HodgeRank not 
only provides a means to determine a global ranking from paired comparison 
data under various statistical models (e.g., Uniform, Thurstone-Mostellcr, 
Bradley-Terry, and Angular Transform), but also measures the inconsistency 
of the global ranking obtained. In particular, it takes a graph theoretic view, 
which maps paired comparison data to edge flows on a graph, possibly im¬ 
balanced (where different pairs may receive different number of comparisons) 
and incomplete (where every participant may only provide partial compar¬ 
isons), and then applies combinatorial Hodge Theory to achieve an orthogo¬ 
nal decomposition of such edge flows into three components: a gradient flow 
representing the global rating (optimal in the L 2 " n orm sense), a triangular 
curl flow representing local inconsistency, and a harmonic flow representing 
global inconsistency. In this paper, we will analyze two random sampling 
methods based on the HodgeRank estimate. 

2.3. Active sampling 

The fundamental notion of active sampling has a long history in machine 
learning. To our knowledge, the first to discuss it explicitly were [34] and [35] . 
Subsequently, the term active learning was coined [36] and has been shown 
to benefit a number of multimedia applications such as object categorization 
[37], image retrieval [38, 39], video classification [40], dataset annotation [41], 
and interactive co-segementation [42], maximizing the knowledge gain while 
valuing the user effort [43]. 

Recently, several authors have studied the active sampling problems for 
ranking and rating, with the goal of reducing the amount of data that must 
be collected. For example, [44] considers the case when the true scoring func¬ 
tion reflects the Euclidean distance of object covariates from a global refer- 
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ence point. If objects are embedded in or the scoring function is linear in 
such a space, the active sampling complexity can be reduced to O(dlogn), as 
demonstrated through a comparison of beer [45] . Moreover, [46] discusses the 
application of a polynomial time approximate solution (PTAS) for the NP- 
liard minimum feedback arc-set (MFAST) problem, in active ranking with 
sample complexity 0{n ■ poly (log n, 1/e)) to achieve e-optimum. The works 
mentioned above can be treated as “learning to rank” which requires a vector 
representation of the items to be ranked, thus can not be directly applied 
to crowdsourced ranking. In the crowdsourcing scenario, the explicit feature 
representation of items is unavailable and the goal becomes to learn a single 
ranking function from the ranked items using a smaller number of samples 
selected actively [30]. In [47], a Bayesian framework is proposed to actively 
select pairwise comparison queries based on Bradley-Terry models. Further¬ 
more, [48] addresses the problem of budget allocation in crowd labeling using 
the Bayesian Markov decision process and characterizing the optimal policy 
using the dynamic programming. Most recently, [7, 8] approaches active 
sampling from a statistical perspective of maximizing the Fisher informa¬ 
tion, which they show to be equivalent to maximizing the Fiedler value of 
the graph (smallest nonzero eigenvalue of the graph Laplacian which arises in 
HodgeRank), subject to an integer weight constraint. In this paper, we shall 
focus on analyzing the Fiedler value of graphs generated based on random 
sampling schemes. 

3. Analysis of sampling methods 

Statistical preference aggregation or ranking/rating from pairwise com¬ 
parison data is a classical problem, which can be traced back to the 18 th 
century with discussions on voting and social choice. This subject area has 
recently undergone rapid growth in various applications due to the avail¬ 
ability of the Internet and development of crowdsourcing techniques. In 
these scenarios, typically we are given pairwise comparison data on a graph 
G = (V,E), Y a : E 4 1 such that Y/* = — YJf where a is the index for 
multiple comparisons, Y[“ > 0 if it prefers i to j and Y[“ < 0 otherwise. In 
the dichotomous choice, Y ■“ can be taken as {±1}, while multiple choices are 
also widely used (e.g., h-point Likert scale, k = 3,4,5). 

The general purpose of preference aggregation is to look for a global score 
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x: V —>• M. such that 


mi n u iM Xi ~ X Y Y ij ), (!) 

zeiRl^l 

x±l 

where L (x,y): M x M — y M is a loss function, u denotes the conhdence 
weight of this comparison which is set to be 1 in this paper but other choices 
are also possible, and Xi (xj) represents the global ranking score of item 
i (J, respectively). For a connected graph G , restricting to the subspace 
(x G Rl y l : x_Ll} guarantees a unique solution in (1). For example, L (x,y) = 
(sign(x) — y ) 2 leads to the minimum feedback arc-set (MFAST) problem which 
is NP-hard, where [46] proposes an active sampling scheme whose complexity 
is 0(n ■ poly (log n, 1/e)) to achieve e-optimum. In HodgeRank, one benefits 
from the use of square loss L (x,y) — (x — y ) 2 which leads to fast algorithms 
to find optimal global ranking x, as well as an orthogonal decomposition of 
the least square residue into local and global inconsistencies [4], 

To see this, let % = (E Q u ij Y £ )/(E a u ij) ( w u = be th e mean 

pairwise comparison scores on which can be extended to a family of 

generalized statistical linear models. To characterize the solution and residual 
of (1), we first define a 3-clique complex Xq = (V, E,T) where T collects all 
triangular complete subgraphs in G\ 

T = | {i,j,k} G : {i,j},{j,k},{k,i} G £ j . 


Then every Y admits an orthogonal decomposition: 


Y = Y 9 + Y h + Y c , 

( 2 ) 

where the gradient flow Y 9 satishes 


Y? = Xi — Xj, for some x G M n , 

(3) 

the harmonic flow Y h satishes 


Y,i + Y + Yu = 0. for each {i,j, k} € T, 

(4) 

uJijY '£■ = 0, for each i G V, 

(5) 


j:(i,j)£E 


and the curl flow Y c satisfies (5) but not (4). 
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The residuals Y c and Y h indicate whether inconsistencies in the ranking 
data arises locally or globally. Local inconsistency can be fully characterized 
by triangular cycles (e.g. i >~ j >- k >~ i), while global inconsistency generi- 
cally involves longer cycles of inconsistency in V (e.g. iyjyky---yi), 
which may arise due to data incompleteness and cause the fixed tournament 
issue. Random sampling to avoid global inconsistency, generally requires at 
least 0(n 3//2 ) random samples without replacement [2, 3]. 

The global rating score x in (3) can be obtained by solving the normal 
equation [4], 

Lx = b. (6) 

Here, L = D — A is the unnormalized graph Laplacian, where A(i,j) = 
u>ij if (i,j) € E, A(i,j ) = 0 otherwise, and D is a diagonal matrix with 
D(i,i) = Eya j) SE Wiji as well as b = div(Y') is the divergence flow defined 

by b * = j) eE '^ijYi r There is an extensive literature in linear algebra 
on solving the symmetric Laplacian equation. However, all methods are 
subject to the intrinsic stability of HodgeRank, characterized in the following 
subsection. 


3.1. Stability of HodgeRank 

The following classical result (see, e.g., [49]) gives a measure of the sensi¬ 
tivity of the global ranking score x against perturbations on L and b. Given 
the parameterized system 


(.L + eF)x(e) — b + ef, x(0) = x 


where F £ R nxn and / £ R n , then 


\x{€) — x\ 

lladl 


< lei II L 


-i 



+ 0(e 2 ). 


Here and throughout this paper, the matrix norm is the spectral norm and 
the vector norm is the Euclidean norm. In crowdsourcing, the matrix L is 
determined by the sampled pairs, and can be regard as fixed when given 
the pairwise data. However, b = div(E) is random because of noise possibly 
induced by crowdsourcing. So there is no perturbation on L , i.e., F = 0, and 
we obtain 


|x(e) — x\ 


< II L 


-ii 


+ 0(e 2 ). 


(7) 
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If the graph representing the pairwise comparison data is connected, the 
graph Laplacian, L , has a one-dimensional kernel spanned by the constant 
vector. In this case, the solution to (6) is understood in the minimal norm 
least-squares sense, i.e. x = L^b where Lf is the Moore-Penrose pseudoinverse 
of L. Hence (7) implies that the sensitivity of the estimator is controlled by 
||id|| = [\ 2 (L)]-\ the reciprocal of the second smallest eigenvalue of the 
graph Laplacian. A 2 (L) is also referred to as the Fiedler value or algebraic 
connectivity of the graph. It follows that collecting pairwise comparison data 
so that \ 2 {L) is large provides an estimator which is insensitive to noise in 
the pairwise comparison data, Y. 

Remark. [7, 8] show that for a fixed variance statistical error model, the 
Fisher information matrix of the HodgeRank estimator (6) is proportional to 
the graph Laplacian, L. Thus, finding a graph with large algebraic connectiv¬ 
ity, X 2 (L), can be equivalently viewed in the context of optimal experimental 
design as maximizing the “E-criterion” of the Fisher information. 

3.2. Random sampling schemes 

In what follows, we study two random sampling schemes: 

1. Go(n,m): Uniform sampling with replacement. Each edge is sampled 
from the uniform distribution on (“) edges, with replacement. This is 
a weighted graph and the sum of weights is m. 

2. G(n, m ): Uniform sampling without replacement. Each edge is sampled 
from the uniform distribution on the available edges without replace¬ 
ment. For m < (™), this is an instance of the Erdos-Renyi random 
graph model G(n,p) with p = mf (g). 

Motivated by the estimate given in (7), we will characterize the behavior 
of the Fiedler value of the graph Laplacians associated with these sampling 
methods. It is well-known that the Erdos-Renyi random graph G(n,p) is 
connected with high probability if the sampling rate is at least p — (1 + 
e) log n/n [13]. Therefore we use the parameter p 0 := 2m/ ((n — 1) log n) > 
1 (where m = n{n — l)p/2 ~ n 2 p/ 2), the degree above the connectivity 
threshold, to compare the efficiency in boosting Fiedler values for different 
sampling methods. 

As a comparison for random sampling schemes, we consider a greedy sam¬ 
pling method of sampling pairwise comparisons to maximize the algebraic 
connectivity of the graph [50, 7, 8]. The problem of finding a set of m edges 
on n vertices with maximal algebraic connectivity is an NP-hard problem. 
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The following greedy heuristic, based on the Fiedler vector, if, can be used. 
The Fiedler vector is the eigenfunction of the graph Laplacian corresponding 
to the Fiedler value. We shall denote the graph with m edges on n vertices 
by G*(n, m). The graph is built iteratively, at each iteration, the Fiedler 
vector is computed and the edge (i, j) which maximizes (?/A — iff) 2 is added 
to the graph. The iterates are repeated until a graph of the desired sized is 
obtained. 


3.3. Fiedler value and minimal degree 

The key to evaluating the Fiedler value of random graphs is via the graph 
minimal degree, d m i n . This is due to the definition of graph Laplacian, 


L — D — A, 

whose diagonal D{i,i) — 0(m/n) dominates as max||„|| = i v T Av ~ 0(^Jm/n). 
The following Lemma makes this observation precise, which is used by [15] 
in the study of Erdos-Renyi random graphs. 

Lemma 1. Consider the random graph Go(n,m ) (or G(n,m)) and let X 2 
be the Fiedler value of the graph. Suppose there exists a p 0 > 1 so that 
2 m > pon log n and C, C\ > 0 so that 


I d r , 




with probability at least 1 — 0(e 


-Q( /Am 


Then there exists a C > 0 so that 


| A 2 — c, —| < C 
n 



Lemma 1 implies that the difference between A 2 and d m in (i.e., minimal 
degree) is small, so the Fiedler value for both random graphs can be approx¬ 
imated by their minimal degrees. 

The proof for G(n,m) follows from [15], which establishes the result for 
the Erdos-Renyi random model G(n,p). The proof for Go(n,m), needs the 
following lemma. 


Lemma 2. Let A denote the adjacency matrix of a random graph from 
G 0 (n,7n) and S = {u_Ll: ||r>|| = 1}. There exists a constant c > 0, such 
that if m > nlogn/2, the estimate 

ma xv T Av < c\/2m/n 
ves v 
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holds with probability at least 1 — 0(l/n). 

With the aid of this lemma, one can estimate the Fiedler value by the 
minimal degree of Go(n,m). In fact, 

A 2 (L) = min (v,Lv) 

v£S 

= min(n, Dv) — (v, Av) 

v£S 

> d min — max(n, Av). 

v£S 

Also Cheeger’s inequality tells that A 2 (L) < -d m \„ . These bounds show 

the validity of Lemma 1. The proof of Lemma 2 is given in Appendix A. 


3-4 ■ A heuristic estimate of the minimal degree 

In this section, we estimate the minimal degree. First, consider the Erdos- 
Renyi random graph model G(n,p) with p = m/(”). Then d 

i rsJ B(n,p), 

so di-np ^ jyYg, The d e g ree s are weakly dependent. If the degrees 
V n P ( d~p) 

were independent, the following concentration inequality for Gaussian ran¬ 
dom variables, 

Prob(max \Xf\ > t) < nexp f ——^ , A" ~ N(0,I n ), 

l<i<n \ 2 / 


would imply that the minimal value of n copies of A/0,1) is about —a/2 logn. 
In this case, 

d mm « np - a /2 log(n)np(l -p), 

implying that 

drri'n 


1 


np 


1 2 log n 
np 


V^ 


p. 


A similar approximation can be employed for Go(n,m). Here, di ~ 
B(m,2/n), so d.-np = ^ N(0, 1). Again, the di are only weakly depen- 

ynp(l-2/n) 

dent, so 

drain ~ np - a / 2 log(n)np(l - 2/n), 

which implies that 


drt 


np 


1 - 


I 2 log n 
np 


\/l - 2/n. 
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Collecting these results and using 

Key Estimates. 


Imin A2 

np np ’ 


G 0 (n,m): — «ai(p 0 ,™):= 
np 

G(n, m): — « a 2 (po , n) : = 
np 


we have the following estimates. 



where p = p ° logn . 

r n 

Remark. As n — > 00 , both (8) and (9) become — ~ 1 — »/ —. But for 

7 v / v / np y Po 

hnite n and dense p, G(n,m) may have larger Fiedler value than Go(n,m). 

The above reasoning (falsely) assumes independence of di, which is only 
valid as n —y 00 . In the following section, we make this precise with an 
asymptotic estimate of the Fiedler value in the two random sampling schemes. 

3.5. Asymptotic analysis of the Fiedler value 

I 11 the last section, we gave a heuristic estimator of the Fiedler value. 
The following theorem gives an asymptotic estimate of the Fiedler value as 
n —> 00 . 


Theorem 1 . Consider a random graph Go(n,m) (or G(n,m)) on n ver¬ 
tices corresponding to uniform sampling with (without) replacement and m = 
Po^log(n)/2. Let X 2 be the Fiedler value of the graph. Then 


X 2 

2m/n 




a(p 0 ) + 0( 



( 10 ) 


with high probability, where a(p 0 ) G (0,1) denotes the solution to 


p 0 -l = ap 0 {l - logo). 


The proof of Theorem 1 is given in Appendix A. 

Remark. For p 0 1 , a(p 0 ) = 1 — \/2/p 0 + O(l/p 0 ) [15]. Thus, Theorem 
1 implies that the two sampling methods have the same asymptotic algebraic 
connectivity, — 1 — a/^/Po as n —> 00 and po ^ 1. Note from (8) and 

(9), that lim n ^. 0O a\(p 0 , n) = \\m n ^ 00 a 2 {po,n) = 1 - sj2/p 0 . 

The main difference between G(n, p) and Go(n, m ) is the weak dependence 
pattern; the dependence of di and dj only occurs on edge (i,j) which only 
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Figure 1: A comparison of the Fiedler value, minimal degree, and estimates (8), (9), 
and (10) for graphs generated via random sampling with/without replacement and greedy 
sampling for n = 64. 


appears at most once for G(n,p), but all of the m edges can be (i,j) for 
Go(n, m). However, we still have di and dj are almost independent when n is 
sufficiently large, so the heuristic estimator using I.I.D. Normal distribution 
as an approximation is not unreasonable. 

Theorem 1 is supported by Figures 1 and 2, where the Fiedler value, 
minimal degree, and various estimates, (8), (9), and (10), are plotted for 
varying edge sparsity, po. For Go(n,m), we observe that a(po) fits d m ; n and 
A 2 pretty well for all p 0 . For G(n,m), we observe that a(p 0 ) fits d min and 
A 2 well when p 0 is small, but when p 0 is large, the estimate give in (9) is 
more reliable. In all cases, the Fiedler value for the graph, G*, generated by 
greedy sampling, is larger than that for randomly sampled graphs. 

4. Experiments 

In this section, we study three examples with both simulated and real- 
world data to illustrate the validity of the analysis above and applications 
of the proposed sampling schemes. The first example is with simulated data 
while the latter two consider real-world data from QoE evaluation. The code 
for the numerical experiment and the real-world datasets can be downloaded 
from https://code.google.com/p/active-random-joint-sampling/. 
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(a) n = 2 5 


(b) n = 2 7 




(c) n = 2 9 


(d) n = 2 11 


Figure 2: Algebraic connectivity and minimal degree: Random sampling with replacement 
vs. Random sampling without replacement for n = 2 5 , 2 7 , 2 9 , and 2 11 . The gaps among 
these sampling schemes vanish as n —> oo. 
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4-1. Simulated, data 

This subsection uses simulated data to illustrate the performance dif¬ 
ferences among the three sampling schemes. We randomly create a global 
ranking score as the ground-truth, uniformly distributed on [0,1] for \V\ = n 
candidates. In this way, we obtain a complete graph with ('") edges consistent 
with the true preference direction. We sample pairs from this complete graph 
using random sampling with/without replacement and the greedy sampling 
scheme. The experiments are repeated 1000 times and ensemble statistics 
for the HodgeRank estimator (6) are recorded. As we know the ground- 
truth score, the metric used here is the /^-distance between the HodgeRank 
estimate and ground-true score, ||x — x*||. 

Figure 3 (a) shows the mean L 2 -distance and standard deviation associ¬ 
ated with the three sampling schemes for n = 16 (chosen to be consistent 
with the two real-world datasets considered later). The x-axes of the graphs 
are the number of edges, as measured by po = taken to be greater than 
one so that the graph is connected with high probability. From these experi¬ 
mental results, we observe that the performance of random sampling without 
replacement is better than random sampling with replacement in all cases 
with smaller L 2 -distance and smaller standard deviation. As po grows, the 
performance of the two random sampling schemes diverge. When the graph 
is sparse, the greedy sampling scheme shows better performance than ran¬ 
dom sampling with/without replacement. However, when the graphs become 
dense, random sampling without replacement performs qualitatively similar 
to greedy sampling. 

To simulate real-world data contaminated by outliers, each binary com¬ 
parison is independently flipped with a probability, referred to as outlier 
percentage (OP). For n = 16, with OP = 10% and 30%, we plot the number 
of sampled pairs against the L 2 -distance and standard deviation between the 
ground-truth and HodgeRank estimate in Figure 3 (b,c). As in the non- 
contaminated case, the greedy sampling strategy outperforms the random 
sampling strategy. As OP increases, the performance gap among the three 
sampling schemes decreases. 

4-2. Real-world data 

The second example gives a comparison of the three sampling methods on 
a video quality assessment dataset [2], It contains 38,400 paired comparisons 
of the LIVE dataset [51] from 209 random observers. An attractive property 
of this dataset is that the paired comparison data is complete and balanced. 
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- Random sampling with replacement 

- Random sampling without replacement 

weedy sampling 

\V 




(a) OP = 0% 


(b) OP = 10% 


(c) OP = 30% 


Figure 3: The -^-distance and standard deviation between ground-truth and HodgeRank 
estimate for random sampling with/without replacement and greedy sampling for n = 16. 


As LIVE includes 10 different reference videos and 15 distorted versions of 
each reference (obtained using four different distortion processes — MPEG- 
2 compression, H.264 compression, lossy transmission of H.264 compressed 
bitstreams through simulated IP networks, and lossy transmission of H.264 
compressed bitstreams through simulated wireless networks), for a total of 
160 videos, the complete comparisons of this video database requires 10 x 
( 1 )’) = 1200 comparisons. Therefore, 38,400 comparisons correspond to 32 
complete rounds. 

As there is no ground-truth scores available, results obtained from all the 
paired comparisons are treated as the ground-truth. To ensure the statistical 
stability, for each of the 10 reference videos, we sample using each of the 
three methods 100 times. Figure 4 shows the experimental results of the 10 
reference videos in LIVE database [51]). It is interesting to obtain similar 
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Figure 4: Random sampling with/without replacement vs. greedy sampling for 10 refer¬ 
ence videos in LIVE database [51]. 


observations on all of these large scale data collections. Consistent with the 
simulated data, when the graph is sparse, greedy sampling performs better 
than both random sampling schemes; as the number of samples increases, 
random sampling without replacement exhibits similar performance in the 
prediction of global ranking scores. 

The third example shows the sampling results on an imbalanced dataset 
for image quality assessment, which contains 15 reference images and 15 
distorted versions of each reference, for a total of 240 images which come 
from two publicly available datasets, LIVE [51] and IVC [52], The distorted 
images in LIVE dataset [51] are obtained using five different distortion pro¬ 
cesses — JPEG2000, JPEG, White Noise, Gaussian Blur, and Fast Fading 
Rayleigh, while the distorted images in IVC dataset [52] are derived from four 
distortion types — JPEG2000, JPEG, LAR Coding, and Blurring. In total, 
328 observers, each of whom performs a varied number of comparisons via 
the Internet, provide 43,266 paired comparisons. Since the number of paired 
comparisons in the is dataset is relatively large, all 15 paired comparison 
graphs are complete, though possibly imbalanced. This makes it possible for 
us to obtain comparable results of these three sampling schemes. As in the 
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Figure 5: Random sampling with/without replacement vs. greedy sampling for 15 refer¬ 
ence images in LIVE [51] and IVC [52] databases. 


second example, quality scores obtained from all the 43,266 paired compar¬ 
isons are treated as the ground-truth. Figure 5 shows mean L 2 -distance of 
100 times on LIVE [51] and IVC [52] databases, and it is easy to find that all 
these reference images agree well with the theoretical and simulated results 
we have provided. 

4-3. Discussion 

In terms of the stability of HodgeRank, random sampling without replace¬ 
ment exhibits a performance curve between the greedy sampling scheme, pro¬ 
posed by [7, 8], and random sampling with replacement. When the sampling 
rate is sparse, greedy sampling dominates; when the sample size is increased, 
random sampling without replacement is indistinguishable from greedy sam¬ 
pling, both of which dominate the random sampling with replacement (the 
simplest I.I.D. sampling). 

Therefore, in practical situations, we suggest first to adopt greedy sam¬ 
pling in the initial stage which leads to a graph with large Fiedler value, 
then use random sampling without replacement to approximate the results 
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of greedy sampling. Such a transition point should depend on the graph 
vertex set size, n, for example po („_ip ogn ~ 91 " gr) which suggests po ~ 3 
for n = 16 in our simulated and real-world examples. After all, this random 
sampling scheme is simpler and more flexible than greedy sampling and does 
not significantly reduce the accuracy of the HodgeRank estimate. 

5. Conclusion 

This paper analyzed two simple random sampling schemes for the HodgeR¬ 
ank estimate, including random sampling with replacement and random sam¬ 
pling without replacement. We showed that for a finite graph when it is 
sparse, random sampling without replacement approaches its performance 
lower bound as random sampling with replacement; when it is dense, ran¬ 
dom sampling without replacement approaches its performance upper bound 
as greedy sampling. For large graphs, such performance gaps are vanishing 
in that all three sampling schemes exhibit similar performance. 

Because random sampling relies only on a random subset of pairwise 
comparisons, data collection can be trivially parallelized. This simple struc¬ 
ture makes it easy to adapt to new situations, such as online or distributed 
ranking. Based on these observations, we suggest in applications first adopt 
greedy sampling method in the initial stage and random sampling without 
replacement in the second stage. For very large graphs, random sampling 
with replacement may become the best choice, after all, it is the simplest 
I.I.D. sampling and when n goes to infinity, the gaps among these sampling 
schemes vanish. The sampling schemes enable us to derive reliable global 
ratings in an efficient manner, whence provide us a helpful tool for those 
who exploit crowdsourceable paired comparison data for subjective studies. 
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Appendix A. Proof of Theorem 1 

The following basic inequality is used extensively throughout the proofs, 
which can be found in [53]. 

Lemma 3. (Chernoff-Hoeffding theorem) Assume thatXi G [0,1], i = 1,..., n 
are independent and EXi = p. For e > 0, the following inequalities hold 

P(X n < (Jr-e)< e - nKL ^- e ^\ 

P{X n >p + e) < e ~ nKL ^ +e IIm) ; 

where KL(p\\q) = p\og(p/q) + (1 —p) log((l —p)/(l — q)), is Kullback-Leibler 
divergence, and X n = f Y^i=\ is the sample mean. 

Corollary 1. 

P(X n < kp) < e -Mkio S (k)-k+i))^ k < 1 
P(X n > kp) < e -Mklog(k)-k+l))^ k > x 

Proof. KL(kp\\p) = kplog(k) + (1 — kp) log(q^). Defining f(p) := (1 — 
kp) log( ) + (k — 1 )p, we compute 

m = o. m = o. rw = (1 ^ > «• 

For all p G (0, min(l, 1 /k)), we have that 

f(p)> 0 « e -n(l-Mlog(i^) < e -n„(k-l) 

The result then follows from Lemma (3). □ 

Throughout this section, E[-] is used for expectation of random variables. 
Next, we prove Lemma 2. 

Proof of Lemma 2. Our proof essentially follows [16] for the Erdos-Renyi ran¬ 
dom graph G(n,p). For Go(n,m), consider A = 'Y^k=i A k where Ak is the 
adjacency matrix of l.I.D. edge samples. Hence 

m m 

v T Av = vTA kV = 2vi P’P 

k =1 k =1 
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is the sum of I.I.D. variables. Let d := 2m/n denote the expected degree of 
each graph vertex. 

The proof strategy is as follows. To reach a bound for max „ e 5 v T Av, 
one needs a discrete cover T of the set S. We turn to an upper bound 
ma x U}V£ tu t Av > c(l — e) 2 ^2m/n. However, any cover, T, has size e 0 ^ and 
therefore directly using Bernstein’s inequality and the union bound doesn’t 
work. Following [16], we divide the set {(«*, Vj ): u, v G T} into two parts: (1) 
light couples with \uiVj\ < \fdfn , which can be bounded using Bernstein’s 
inequality and (2) heavy couples with \u t v 3 \ > Vd/n but satisfying bounded 
degree and discrepancy properties. These two parts make up of the variation 
in u T Av which will lead to the bound in Lemma 2. 

Following [16], the first step is to reduce the set of vectors into a finite, 
yet exponentially large space. Let S = {u_Ll : ]|u|| < 1} and for fixed some 
0 < e < 1, define a grid which approximates S\ 

T = { xG (\^ Z ) : I]^ = 0 ’INI ^ 1 

Claim [16]. The number of vectors in T is bounded by e CcTl for some c e 
which depends on e. If for every u,v E T u T Av < c, then for every x G 
S , x T Ax < c/( 1 — e) 2 . 

It remains to show that 

Claim. 3c, almost surely, Vw, v E T, u T Av < c^/2m/n. 

To prove this claim, we divide the set {(): u,v G T} into two parts: 

(1) light couples with \u,Vj\ < Vd/n and 

(2) heavy couples with UiV 3 > \fd/n , 

Let 



Then 


H k 


Ui k V 3k 1 {\ui k Vj k \<A±} 

Vj k l {luikV . kl >& } 


+ UjkVikl {\uo k Vi k \<^} 
+ Vi k l { lu . kVikl> ^y 


m m 

U T Av = ^2 L k + "^2 Hk- 
k= 1 k =1 
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Bound on the contribution of light couples. It’s easy to compute 


ELfc + E H k — -—r 'y ' 

nyn — 1) 


'U'i'Vj 


#3 


n(n — 1) 


(u, v) 


■ I 2 \ ■> | 2 y ■> i 9 2 / 

E Hk\ = -7T / UiVj l r , - tv / \UjVj — 

Tl(Tl — 1) ‘ ^ {||^tt I n(n — B ‘ 1 3* , n 


*7 H 


{WiVj \>—}' n(n — 1) 

2 

< 


< 


(n — 1 )\fd 
2 

(n — 1 )\fd 


i,j:\uiVj\>^ 

E“?E 


n 


So |E£™ =1 L fc | <m\EH k \+m\EL k + EH k \ = 0(Vd), as d = 2m/7i. We also 
have that 


Var(L fc ) < E(L fc ) 2 < 2E ((u ik v jk l {|tl j<^ } ) 2 + iVh 1 


< 


< 


n(n — 1) 
4 

n(n — 1) 
> Vd 


{\u ih v jk \<^y 1 y-3 k ^ {]UjkVik \<^ } 

E“. 2 (E4) 

* j¥=i 


> 2 ) 


From definition, \L k \ < 2^-, so \L k — EL^I < \L k \ + |ELfc| < 4^- = M, Then 
Bernstein’s inequality gives 

/ m m \ 

p E l ‘- e E L k > cVd J < exp 


,fc=i 


k= i 


c 2 d 


) 


< exp 


2mVar(Lfc) + 2Mc\fd/3) 
c 2 d 


TfiWT) + 8cd/3n^ 

< exp (—O(cn)). 

So taking a union bound over u, v G T, the contribution of light couples is 
bounded by c\fd with probability at least 1 — e~°( n \ 

Bound on the contribution of heavy couples. As shown in [16], if the 
random graph satisfies the bounded degree and discrepancy properties, then 
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the contribution of heavy couples is bounded by O(Vd). We next define 
these two properties and prove that they hold. 

Bounded degree property. We say the bounded degree property holds if 
every vertex has a degree bounded by c\d (for some c\ > 1). Using the fact 
di ~ B(m, n), n = 2/n, together with Lemma 3 and m > n log n/2, we have 

P{di > 6 m/n) < e - m A3io g (3)-2) < e~^ < n~ 2 . 

So taking a union bound over i, we get with probability at least 1—1/n, Vi, di < 
3d. 

Discrepancy property. Let A,BC. [n] be disjoint and e(A,B) be a ran¬ 
dom variable which denotes the number of edges between A and B. Then 
e{A,B) ~ B(m, So, p(A, B) = p\A\ • \B\, with p = ^ is 

the expected value of e(A,B). Let A (A, B) = e(A, B)/ /i(A, B). We say that 
the dispcrepancy property holds if there exists a constant c such that for all 
A, B C [n] with \B\ > |A| one of the following holds: 

1. A (A,B) < 4, 

2. e{A, B) ■ log A (A, B) < c ■ \B\ ■ log y§j. 

We will show that the discrepancy property holds with probability of at least 
1 — 2/n. Write a — \A\,b — \B\, and suppose b > a. We assume the bounded 
degree property holds with c\ — 3 here. 

Case 1: b > 3n/4. Then p(A, B) = ab- While e(A, B) < a ■ 3d 

as each vertex in A has degree bounded by 3d, so A (A, B) < 4. 

Case 2: b < 3n/4. Using the fact e{A,B) ~ B(m,q), with q = ^ and 
Lemma 3, we have for k > 2, 

P{e{A,B ) > kp(A,B)) < e -mq(kiog(k)+(i-k)) < e -MTS)fciog(fc)/4_ 


Then the union bound over all A, B with size a, b for e(A, B ) > kfi(A, B) is 

C a C b e _M ( A, ' B ) fclog ( fc )/ 4 < e -M(^,s)feiog(fe)/ 4 ^ ^ 

We want the right hand side of (A.l) to be smaller than 1/n 3 , so it is enough 
to let 

p(A, B)k\og(k)/4 > a ^1 + log — j + b ^1 + log ^ + 3 logn. 


(A.2) 



Next, we are to give an upper bound for the right hand side of (A.2). Using 
the fact a; log (n/x) is increasing in (0, n/e) and decreasing in (n/e, 3n/4), we 
have for 6 < n/e, 


0 ( 1 + l0g 2) +i ,( i + i og 2) 

and for 3n/4 > 6 > n/e, 

0 (l + log j) + 6 (l + log j) 


Tl Tl 

+ 3 log n < 46 log — + 3 log n < 76 log —, 
6 6 ' 

+ 31ogn < 2 • 3n/4 + 2 • n/e + 3 logn 

< 11 • 3/4 ■ log(4/3)n 

Tl 

< 116 log—. 

6 


Therefore to make (A.2) valid, it suffices to assume k log k > ^^ blog 
Let k 0 >2be the minimal number that satisfies this inequality. Using the 
union bound over all the possible a, 6, we get the following conclusion. With 
probability of at least 1 — 1/n, for every choice of A, B (6 < 3n/4) the 
following holds, 

e(A,B) < k 0 /i(A,B). 

If k 0 — 2 then we are done, otherwise k 0 log(fc 0 )/i(A, B) = 0(l)61og|, so 
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e(A,B)-\og\(A,B) < e(A,B) log (k 0 ) < k 0 log(k 0 )/u(A, B) = 0(l)|5|log—, 


as desired to satisfy the second condition for the discrepancy property. □ 


Proof of Theorem 1. The proof for G(n,m ) follows from [15], which estab¬ 
lishes the result for the Erdos-Renyi random model G(n,p). Here we follow 
the same idea for G 0 (n, m), using an exponential concentration inequality 
(Lemma 3) to derive lower and upper bounds on d min . The lower bound is 
directly from a union bound of independent argument. The upper bound 
deals with weak dependence using the Chebyshev-Markov inequality. 

Using Lemma 1, we only need to study the asymptotic limit of ■ As 
di ~ B(m, /i), // = 2/n, using Lemma 3 we have 

P(di < 2am/n) < e -^M(alog(a)+(l-a)) = e ^W(a)_ (A.3) 

where TL(a) = a — alog(a) — 1. 


29 




In the other direction, suppose i 0 = 2am/n is an integer, we have 


P(di < 2am/n) > C^{2/n) io (l - 2/n) m ~ io 

(m — i o y° 


> 


eVk)(io/e) io 

1 


(2/(n-2)r°(l-2 /ny 


io(l+log((n/a-2)/(n—2)))+mlog(l—2/n) 


> 


> 


eV*o 

1 

ey/io 

1 

e 3 \/*o 

1 


, 20 ( 1 —log(a))— 2m/n— 4m/n 2 


O *o(l—log(a) —1/ a) 


(A.4) 




e 3 \/a— 

' ' n 


The inequality in (A.4) follows from log(l — x) > —a; — x 2 for all x G [0, 2/3] 
and the assumption that n > 3. The last inequality is due to m < n 2 /2. 

If 2am/n is not an integer, we can still have 


P(di < 2am/n) > 


C 2m 


y/2m/ 


H{a) 


(A.5) 


n 


Equations (A.3) and (A.5) give an estimate for P(di < 2am/n). 
Now let a ± = a(p 0 ) ± 1/ y/2m/n, Taylor’s theorem gives 


H^a^) = H(a(p 0 )) ± 


W{a±) 

\j2m/n 


where a + G (a(po),a(po) + 1/ \/2m/n), a_ G (a(po) — 1/yj2m/n, a(po)), and 
Id'(a) = — log(a) is the derivative of Id. Note poH(a(po)) = —1, so 


P I d min < a(p 0 )— - \l— 1 < ne (2m/n)H{a } = e -^ w '(“-) = 0 ( e -n(v^AO) 
n\n 


Therefore, with probability at least 1 — 0(e Q (\/ 2m /«)^ 


2m / 2 m 

dmin > a{p 0 ) - \ —. 

n V n 


(A.6) 
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Now we prove the reverse direction. Let f n = P(d m ; n < a + —), X, = 
l{ d .< a + 2 rn} and N 0 = Y^=i So /x 0 = E N 0 = nf n . Chebyshev’s inequality 
implies that 

iW - tM>\ > nfn/ 2) < 4Va 2 r( f o) . (A.7) 

n In 

n 

Var (N 0 ) = ]T Var(X i )+2 ^ Cov(^, Xj) = n/ n (l-/„)+n(n-l)Cov(X 1? X 2 ). 

i =1 i<j 

Next, we are going to claim 

Cov(X l ,X 2 )<0(l)pfl 

i-e. P(X 1 = 1,X 2 = 1) < (1 + 0(l)p)p n . 

It is enough to prove \/k < a + ^ 

2 rr) 

P(di < a + — | d 2 = k) < (1 + 0(l)p)f n . 

n 

Note the conditional distribution of d\ given d 2 = k is 

B(k, 1 /{n — 1)) + B(m — k, 2 /(n — 1)), 
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Orn 

P(d ! < a + — |d 2 = k) 
n 

7„ 

,n — 2 , 


„4- 2m 

7. 1 -7. 


v ct_ k < 

3=0 


i =0 

2m 


^^ \m-k-j 


n — 1 n — 1 

2 s „ „■ , n — 3 


\ ' \ ' / f / ( 2- 2 ( “ \s—if 10 u \m—k—s+i 

^ ^ fc n — 1 n-^T m_fc n — 1 - r 

s=0 i=0 


(- 2m 


fcAs 


E E^SAt^T - ?)‘"‘(:A t)'(" 3 ' m_ ‘ 

S=0 2—0 

ry.+ 2in 


z n — 3 y n — r v n — 1' 


kf\s 


< 


s=0 i=0 


+ 2m 


< E C^(^-r)* + ‘(-)‘( —)”~ 

n — 1 n n 

s=0 


a 


+ 2m 


< 


n 


\2a+^ 


n — V 


E c ™y*( 


2 s „ ,n — 2 , 


s=0 


n n 


= (l + 0{l)p)f n . 

Hence, we get Var(iV 0 ) < nf n + 0(l)n 2 / 2 p, and (A.7) gives 

P(|iVo - /i 0 | > n/ n /2) < -^ + 40(l)p. 

Tljn 

Note that nf n > -^ =e V^ H '^+) —> oo, so with probability at least 

a/ 2m/n 


1 — 0(e 


-n 


, the graph has at least nf n /2 —> oo vertices satisfying 


2 m 2m 

di < a(po) -b \ —• 

n V n 


Clearly d min also satisfies this statement. Combining this result with (A.6), 
we have that with probability at least 1 — 0(e~^ ( 


2m 2m 

n m i n (Xypo) \ if \ j 

n V n 


as desired. 


□ 
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